Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Task computer "stuck" on subtask #4123

Closed
wants to merge 8 commits into from
Closed

Conversation

maaktweluit
Copy link
Contributor

@maaktweluit maaktweluit commented Apr 18, 2019

My golem node got stuck on a subtask:

provider_state: {'status': 'Computing', 'subtask': {'subtask_id': 'SUB-ABC', 'progress': 0.0, 'seconds_to_timeout': -416547.8222870827, 'running_time_seconds': 419350.25729084015, 'outfilebasename': 'test_task', 'output_format': 'PNG', 'scene_file': 'bmw.blend', 'frames': [0], 'start_task': 5, 'total_tasks': 50}, 'environment': None}

When investigating i think the header was already deleted by the check_max_tasks_per_owner() cleanup, while the task was computing:

2019-04-11 12:25:21 INFO     golem.task.taskcomputer             Starting computation of subtask 'SUB-ABC' (task: 'TASK-DEF', deadline: 1554988323, docker images: [{'repository': 'golemfactory/blender', 'image_id': None, 'tag': '1.4'}])
2019-04-11 12:27:31 WARNING  golem.task.taskkeeper               Too many tasks from 839b...9cba, dropping 1 tasks
2019-04-11 12:30:20 ERROR    golem.task.taskcomputer             No subtask with id 'SUB-ABC'

This PR solves it 2x:

  • Better errors in task_computer, no longer return on error but let it clean up like other failures.
  • Knowledge in task_keeper of what tasks are running, so they are not removed during cleanup.

Blocked by: #4122 ( lint errors )

@maaktweluit maaktweluit self-assigned this Apr 18, 2019
@ghost ghost added the in progress label Apr 18, 2019
Copy link

@Wiezzel Wiezzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the remark about using set instead of list, I would suggest unit-testing this change somehow.

@maaktweluit maaktweluit force-pushed the mwu/no-subtask-id branch 2 times, most recently from 636c455 to b2915b6 Compare April 19, 2019 09:04
@maaktweluit
Copy link
Contributor Author

Rebased to develop now #4122 is solved

@maaktweluit
Copy link
Contributor Author

@Wiezzel thanks for the review!

Changed list to set and extended the tests, please check again :)

@@ -276,6 +276,7 @@ def test_task_limit(frozen_time, self): # pylint: disable=no-self-argument
self.assertIn(tb_id, tk.task_headers)

def test_check_max_tasks_per_owner(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test func got really long. Is it possible to split it?

Copy link

@Wiezzel Wiezzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine

@codecov
Copy link

codecov bot commented Apr 23, 2019

Codecov Report

Merging #4123 into develop will increase coverage by <.01%.
The diff coverage is 76.92%.

@@             Coverage Diff             @@
##           develop    #4123      +/-   ##
===========================================
+ Coverage    88.79%   88.79%   +<.01%     
===========================================
  Files          215      215              
  Lines        18675    18694      +19     
===========================================
+ Hits         16583    16600      +17     
- Misses        2092     2094       +2

@maaktweluit
Copy link
Contributor Author

Duplicate of #4159

@maaktweluit maaktweluit marked this as a duplicate of #4159 May 14, 2019
@ghost ghost removed the in progress label May 14, 2019
@maaktweluit maaktweluit deleted the mwu/no-subtask-id branch June 14, 2019 10:20
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants